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Abstract 
We  continue  the  work  of  Sobel  on  axioms  for  preferences  in  discrete 
Markov  processes.   Sufficient  conditions  for  optimality  are  presented, 
and  the  logical  interrelation  with  previous  axiomations  is  discussed. 


Axioms  and  Examples  Related  to  Ordinal  Dynamic  Programming^ 

by  Charles  E.  Blair 

We  consider  deterministic  sequential  Markov  process.   Let  X  be 

a  set  of  states.   For  each  xeX,  M(x)  C  x  is  the  set  of  states  that  can 

be  reached  in  one  step  from  x.   Define  A  to  be  the  set  of  mappings 

6:X-»-X  such  that  6(x)eM(x)  for  every  xeX.  A  policy  is  an  infinite 

sequence  "5,6^  ...  where  6.eA.  A  stationary  policy  has  all  6.  equal. 

For  each  policy  -n  =   ^■[^■y    •••  ^^^   each  xeX  there  is  a  unique  sequence 

X  x^x-  ...  such  that  X-,  =  x  and  x  =5(x  T),n  =  l,  2,  ...   We  will 
0  12  0         n    n  n-1  '         ' 

denote  this  sequence  by  P(ir,x).   For  xeX,  $  is  defined  to  be  the  set 

of  sequences  P(tt,x)  that  arise  as  it  varies  over  all  possible  policies. 

i     is  the  set  of  all  posterities  with  initial  state  x. 
X  '^ 

Sobel  [1]  studied  situations  in  which  orderings  are  assigned  to 
the  sets  $  ,  which  satisfied  various  axioms.   For  p,  q  e^  we  thus  have 
an  ordering  under  which  either  p  >_  q  or  q  ^  p.  The  ordering  on  pos- 
terities induces  a  partial  ordering  on  policies:   tt^  >_  7t„  if  and  only 
if  P(Tr,  ,x)  >_P(-ir„,x)  for  all  xeX.   An  optimal  policy  it  is  one  such  that 
TT  >_  tt"  for  all  policies  it"'. 

[1,2]  showed  that,  provided  certain  axioms  hold  with  regard  to 
the  orderings  on  posterities  and  policies  these  results  hold: 

(1)  If  there  exists  an  optimal  policy,  then  there  exists  an 
optimal  stationary  policy. 

(2)  If  IT  =  6  6^6.  ...  and  for  every  6eA  it  >_  66^(5^6,  ...  =  Sir, 
then  TT  is  optimal. 

(3)  If  X  is  finite  there  is  a  stationary  optimal  policy. 
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We  follow  [1]  in  assuming  throughout  that  the  orderings  on 

$  satisfy 

(4)      if  p   ,    p„e$     and  x„    ...  x     is  a  sequence  such  that  x.£M(x,   ,) 
12x  0  n  ^  1^   i-1 

1  <   i  <   n  and  xeM(x  )    then  x„    . . .  x  p,    >  x„   , . .   x  p„   if   and  only  if 
—      ~  n  (J  ni  —    U  nz 

Pi  ^  P2- 

Here  x^  . . .  x  p  is  the  sequence  of  states  formed  by  concatenating 

x^  ...  X  and  p.  The  hypotheses  imply  that  these  two  sequences  are 

members  of  $  ,  The  intuitive  content  is  that  if  one  sequence  is 

preferable  to  another  when  x  is  the  starting  state,  then  the  same  holds 

if  X  is  reached  at  a  later  time. 

(A)   is  satisfied  by  most  criteria  that  one  xrould  want  to  use  in 

a  dynamic  programming  problem.  However  additional  assiimptions  must  be 

made  in  order  to  obtain  (l)-(3). 

[1]   proposes  the  "countable  transitivity"  axiom 

C5)  Let  p.e$   i  =  0,  1,  2,  ...  .   If  for  i  ^  1,  the  first  i 

terms  of  p.  coincide  with  the  first  i  terms  of  p-  and  p,  ^  Po  ^  Pt  ^  •••> 

then  p„  >  p.  for  all  i. 
0  —  '^i 

However  (4)  and  (5)  do  not  imply  (2) .* 

Example;   Let  X  =  {0,1}.   M(0)  =  X.   M(l)  =  {1}.   $  consists  of 
the  single  posterity  1111...  $_  consists  of  the  posterities  0000...  and 
0^111...  for  k  >_1.  Define  000...  >  01111...  >  OCll...  >  etc.   (4)  is 
easy  to  verify.   (5)  is  satisfied  because  p,  f.  Po  ^Po'«-  implies  (in 
this  example)  that  p.  =  p^^  for  all  sufficiently  large  i.   Let 


*This  corrects  theorem  3  of  [1] .   Sobel  had  discovered  this  inde- 
pendently while  writing  [3].   This  motivated  the  use  of  the  alternative 
axiom  (6)  in  [2] . 


-3- 


09 

6^(0)    =  1  =  6^(1)      Then   the  policy   it  =   6^  6^  6^    ...    =   6,    satisfies 

■n>6-n  for  any   6eA.      But   if    S^^^^    =   0   and   tt'   =   6~  then  P(Tr,0)    =   01111.., 

j^  P(Tr',0)    =  000...,    hence   it  ^  tt"  and  tt   is  not  optimal. 

It  can  be  shown  that    (4)   and    (5)    imply  strengthened  versions  of 
(1)   and    (3). 

Theorem  1_:      Assume    (4)    and   (5)   hold.      Suppose  that   there  is  a 
6eA  such  that,    for  every  xeX,    if   pe'J     there  is  a  p'e;*     whose  first  two 

CO 

terms  are  x,6(x)  with  p  ^  p.   Then  6  is  an  optimal  policy. 

Proof;  Let  xeX,  pe$  .  We  will  construct  a  sequence  of  p.e*. 

such  that  p^  =  P  f.  Po  ^  Po  £  ••  •  ^^<i  the  first  i  members  of  p.  coincide 

with  the  first  i  members  of  P(6  ,  x) .  We  start  with  p^  =  p  and  continue 

by  induction.   If  p^ ,  ...  p  have  already  been  constructed  let  p  =  x_x,  ... 

By  hypothesis,  there  is  a  qe$     such  that  q  >_  x  _^x  ...  and  the 

n-1 

first  two  terms  of  q  are  x  _,  and  6(x  _, )  .  By  (4),  P  ^-i  =  Xq'^I*' '^n-'"'^ 
>_ p  .  This  completes  the  construction  of  the  p..   (5)  implies  that 

CO  00 

P(6  ,x)  ^  p.   Since  x,  p  were  arbitrary  6      is  optimal.  Q.E.D. 

Theorem  1  has  a  converse  in  the  sense  that  if  no  6  exists  satisfying 
the  hypothesis  then  no  policy  is  optimal. 

CO 

Corollary  2*;   If  tt  =  '5  fi-'^a  •••  ^^  an  optimal  policy,  then  6^ 
is  an  optimal  policy. 

Proof:   In  this  case  p"  in  the  hypothesis  of  Theorem  1  is 
P(x,tt).  Q.E.D. 


*This  result  is  established  in  the  proof  of  Theorem  2  of  [1] 
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Corollary  3_:   If  X  is  finite  there  is  a  stationary  optimal  policy. 

Proof;   For  eachxeX,  $  =  U     0  ,  where  Q  consists  of  those 

""       yeM(X)  ^        ^ 

posterities  whose  first  two  terms  are  x,y.   If  an  ordered  set  is  the 
union  of  finitely  many  subsets  at  least  one  of  the  subsets  is  such 
that,  for  each  point  of  the  set,  there  is  a  point  of  the  subset  at  least 
as  large.   If  6(x)  is  chosen  so  that  Q.,  .  is  such  a  subset,  then  the 
hypothesis  of  Theorem  1  is  satisfied  and  6  is  a  stationary  optimal  policy. 

An  alternative  to  (5)  was  proposed  in  [2] : 
(6)       Let  TT  =  (6^6„...)  and  E,   be  two  policies. 

then  5  ^  <5   ,..(5,  C  for  all  k  implies  C  ^  '"^ 
C  f_  6^  ...  6,E,   for  all  k  implies  ^   <_  i\ . 

(4)  and  (6)  together  imply  (1) ,  (2)  and  (3) .  However  there  are 
two  objections  to  (6).  First,  it  discusses  the  partial  ordering  on 
policies  rather  than  the  total  ordering  on  posterities,  and  is  thus 
somewhat  indirect.   Second,  (6)  excludes  lexicographic  discounted- 
retum  criteria,  a  fairly  natural  class  of  preference  orderings 
(example  3  of  [1]). 

Example  2;  Let  X  =  {0,1}.  M(0)  =  M(l)  =  X.   For  a  posterity 

GO 

F  =  x„XtX„  ...  define  v  (p)  =  z  (4-)\.(x^  i  .^  ) ,  i  =  1,2.  r^  (0,0)  =  1; 
0  1  Z  1        .   ^   1  n— 1  n  1 

n=l 

r^(l,l)  =  2;  r^(0,l)  =  r^(l,0)  =  0.  r2(0,l)  =  1;  r2(0,0)  =  r2(l,l)= 

r2(l,0)  =  0.  For  p,  p'e$  P  1  p'  iff  v^(p)  >  v^(p')  or  v^ip)   = 

v^(p-')  and  V2(p)  >.V2(p').   Let  £,   =  6^°°,  where  6^(0)  =  6^(1)  =  0. 

Let  ^  =  6^62"  where  62(0)  =  1,  6^(1)  =  0;  £,(0)  =  0.62(1)  =  1.  v^(P(5,0)) 

v^(0")  =  1,  v^(P(C,l))  =  J.     v^(P(<?2C,0))  =  v^(010")  <  v^(P(C,0)). 
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Since  ?iS„E.,l)   =  ?{£,,!),    it  follows  that  E,   >_  6  C-   Similarly,  it  can 
be  verified  that  E,   >_  <5„6  E,   for  every  positive  k.   Since  v^(P(tt,0))  = 
v^(Ol")  =  1  =  v^(P(5,0))  and  v^(?(n,0))    =  j  >   v^CPCC.O))  =  0,  it 
follows  that  5  ^  TT,  which  contradicts  (6)  . 

An  alternative  to  (6)  is  the  "dual"  to  (5) . 

(5')   Let  p.£$  i  =  0,1,2,...   If  for  i  ^  1,  the  first  i  terms 
of  p.  coincide  with  the  first  i  terms  of  p  and  p  ^  Po  ^  Po  2.  ♦••» 
then  Pn  <  p .  for  all  i. 

Theorem  2;   (4)  and  (5')  imply  (2). 

Proof;   Suppose  tt  >_  6tt  for  every  6eA  and  let  E,   =  6^  6„6_  ...  and 
xeX.   Then  repeated  application  of  (4)  gives  it  ^  <5^  it  >_  <5^  6_Tr  >_  6  fi^S-ir  _^  . . . 
hence  P(Tr,x)  >_  P(6  tt,x)  >_?(&   S^-n,x)   ^  ...  Hence  (5')  implies  PCtt.x)  >_ 
P(5»x).   Since  x  and  E,   were  arbitrary  this  implies  tt  is  optimal.    Q.E.D. 

Corollary:   If  the  orderings  on  0  are  given  by  lexicographic 
discounted-return  criteria  then  (1),  (2),  (3)  hold. 

Proof;   It  suffices  to  verify  that  (5)  and  (5')  both  hold.   This 

is  easily  established  by  noting  that  v.(p^)  =  Limv.(p  ).  Q.E.D. 

n-^ 

It  seems  that  (5)  and  (5')  are  preferable  to  (6).   [1]  suggests 

that  there  are  several  problems  still  to  be  addressed  in  the  case  in  which 

X  is  infinite.   We  would  like  to  mention  this  issue:   in  those  cases  in 

which  there  is  no  optimal  stationary  policy  (hence  no  optimal  policy 

by  (1))  v?hen  is  it  the  case  that  for  every  policy  tt  there  is  a  stationary 

policy  6   such  that  6      >  it? 
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